Data Download


  • My choice of dataset is GSE209552.
  • Note: GSE209552 is single nucleus RNA-seq (snRNA-seq), a varient of scRNA-seq that only sequence nucleus RNA.
# Define global constant
GSE_NUM <- "GSE209552"

GEOmetadb setup

To setup GEOmetadb Tutorial, follow GEOmetadb Tutorial Chapter 3

# Install required package if not installed
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
if (!requireNamespace("GEOmetadb", quietly = TRUE))
  BiocManager::install("GEOmetadb")
if (!requireNamespace("DBI", quietly = TRUE))
  install.packages("DBI")
if (!requireNamespace("RSQLite", quietly = TRUE))
  install.packages("RSQLite")
if (!requireNamespace("dplyr", quietly = TRUE))
  install.packages("dplyr")
if (!requireNamespace("ggplot2", quietly = TRUE))
  install.packages("ggplot2")

# Attach packages
library(GEOmetadb)
library(DBI)
library(RSQLite)

# Download SQLite database if not exist
if (!file.exists('GEOmetadb.sqlite')) getSQLiteFile()
file.info('GEOmetadb.sqlite')

# Connect to database
con <- dbConnect(SQLite(), 'GEOmetadb.sqlite')

Download whole GSE209552

Download data, follow Identifier Mapping Tutorial. The helper function defined below is modified based on Identifier Mapping Tutorial’s helper function to avoid duplicated download.

# Define helper function
fetch_geo_supp <- function(gse, destdir = "data") {
  # Calculate the actual path where GEOquery puts the files
  target_path <- file.path(destdir, gse)
  
  # Check if directory exists AND contains files
  if (dir.exists(target_path) && length(list.files(target_path)) > 0) {
    message(paste("Data found at", target_path, "- Skipping download."))
    return(invisible(target_path))
  }
  
  # If missing or empty, proceed with download
  message(paste("Downloading", gse, "to", destdir, "..."))
  dir.create(destdir, showWarnings = FALSE, recursive = TRUE)
  GEOquery::getGEOSuppFiles(GEO = gse, baseDir = destdir, makeDirectory = TRUE)
  
  invisible(target_path)
}

# Download
fetch_geo_supp(gse = GSE_NUM)

Construct snRNA-seq only reference list

This GSE contain both snRNA-seq and bulk RNA-seq sample. Removal of bulk RNA-seq is needed to avoid significant batch effect due to their technical difference. Although the number of samples is reduced to 17, it’s still sufficient amount.

Refer to the original paper’s method section, only the snRNA-seq was done by Illumina NextSeq6000 platform, GPL24676. We can use this clue to exclude all bulk RNA-seq.

Since all GSMs of this study are packed in one GSE209552_RAW.tar file in GEO, we can’t indicate which platform to include/exclude at download step. Thus we will construct a list of GPL24676 GSMs for downstream reference by query GEOmetabd.

# Define desired platform
snRNA_platform <- "GPL24676"

# Construct the query
# We join the 'link' table (gse_gsm) with the 'sample' table (gsm) 
# to check the platform (gpl) column.
sql_filter <- paste0(
  "SELECT t1.gsm ",
  "FROM gse_gsm t1 ",
  "JOIN gsm t2 ON t1.gsm = t2.gsm ",
  "WHERE t1.gse = '", GSE_NUM, "' ",
  "AND t2.gpl = '", snRNA_platform, "'"
)

# Get the list of pure snRNA-seq samples
sn_gsm_list <- dbGetQuery(con, sql_filter)$gsm

# View results
print(paste("Found", length(sn_gsm_list), "snRNA-seq samples."))
[1] "Found 17 snRNA-seq samples."
sn_gsm_list
 [1] "GSM6376803" "GSM6376804" "GSM6376805" "GSM6376806"
 [5] "GSM6376807" "GSM6376811" "GSM6376812" "GSM6376813"
 [9] "GSM6376814" "GSM6376815" "GSM6376816" "GSM6376817"
[13] "GSM6376818" "GSM6376819" "GSM6376820" "GSM6376821"
[17] "GSM6376822"

Access data quality

# Install Seurat if not
if (!requireNamespace("Seurat", quietly = TRUE))
  install.packages("Seurat")

Reorganize snRNA-seq files

The raw count data was stored in GSE209552_RAW.tar file, thus need to be unpacked using untar() function.

# Attach packages
library(Seurat)
library(data.table)
library(Matrix)
library(dplyr)
library(stringr)

# Define data paths
tar_file <- "data/GSE209552/GSE209552_RAW.tar"
extract_dir <- "data/GSE209552/raw_files"

# Define unpack helper function
unpack_geo_tar <- function(tar_path, dest_dir) {
  if (dir.exists(dest_dir) && length(list.files(dest_dir)) > 0) {
    message(paste("📂 Extracted files found at", dest_dir, "- Skipping unpack."))
    return(invisible(dest_dir))
  }
  if (!file.exists(tar_path)) {
    stop(paste("❌ Tar file not found:", tar_path))
  }
  message(paste("📦 Unpacking", basename(tar_path), "..."))
  dir.create(dest_dir, showWarnings = FALSE, recursive = TRUE)
  untar(tar_path, exdir = dest_dir)
  message("✅ Unpacking complete.")
  return(invisible(dest_dir))
}

# Run the unpacker
unpack_geo_tar(tar_file, extract_dir)

The unpacked file is very messy, contain all 10x genomics snRNA-seq files, bulk RNA-seq files, subsetted Transposable Elements file, clustered files, and normalized files. Among those, only 10x genomics snRNA-seq is what we need.

To extract, we utilize previously defined sn_gsm_list, loop over it, exclude any file contain _TE_, indicate Transposable Elements, only extract files end with either barcode, matrix, or features. Lastly, trim the file name to barcode, matrix, features only, to meet the expected naming format of Seurat’s Read10X() function.

# Reorganize all snRNA-seq 10x files in to a new directory
dest_dir <- "data/GSE209552/10x_organized"    # Where they should go
dir.create(dest_dir, recursive = TRUE, showWarnings = FALSE)

# --- MAIN LOOP ---
message(paste("🚀 Organizing files for", length(sn_gsm_list), "samples..."))

skipped_count <- 0
moved_count   <- 0

for (gsm in sn_gsm_list) {
  
  # 1. Define the specific destination path for this sample
  sample_subdir <- file.path(dest_dir, gsm)
  
  # 2. CHECK: Does it already exist?
  if (dir.exists(sample_subdir) && length(list.files(sample_subdir)) >= 3) {
    skipped_count <- skipped_count + 1
    next
  }
  
  # 3. Create directory
  dir.create(sample_subdir, showWarnings = FALSE)
  
  # 4. Find source files (Exclude the 'TE' files)
  pattern <- paste0("^", gsm)
  files_to_move <- list.files(extract_dir, pattern = pattern, full.names = TRUE)
  files_to_move <- files_to_move[!grepl("_TE_", files_to_move)]
  
  # Safety Checks
  if (length(files_to_move) == 0) {
    warning(paste("⚠️ Skipping", gsm, "- No source files found."))
    next
  }
  
  # 5. MOVE & RENAME
  # We identify specific files to ensure we rename them correctly for Seurat
  f_matrix   <- files_to_move[grep("matrix", files_to_move, ignore.case = TRUE)]
  f_barcodes <- files_to_move[grep("barcodes", files_to_move, ignore.case = TRUE)]
  f_features <- files_to_move[grep("features|genes", files_to_move, ignore.case = TRUE)]
  
  # Verify we found exactly 1 of each before moving
  if (length(f_matrix) == 1 && length(f_barcodes) == 1 && length(f_features) == 1) {
    
    # Rename Matrix -> matrix.mtx.gz
    file.rename(f_matrix, file.path(sample_subdir, "matrix.mtx.gz"))
    
    # Rename Barcodes -> barcodes.tsv.gz
    file.rename(f_barcodes, file.path(sample_subdir, "barcodes.tsv.gz"))
    
    # Rename Genes/Features -> features.tsv.gz
    file.rename(f_features, file.path(sample_subdir, "features.tsv.gz"))
    
    message(paste("  ✅ Standardized & Moved:", gsm))
    moved_count <- moved_count + 1
    
  } else {
    warning(paste("  ⚠️ Skipping", gsm, "- Could not find unique 10x triplet to standardize."))
  }
}

# --- SUMMARY ---
message("🎉 Organization Complete.")
message(paste("  🔹 Samples moved & standardized:", moved_count))
message(paste("  🔸 Samples skipped (already done):", skipped_count))

Load snRNA-seq files into R using Seurat

Loading of expression data is done by iteratively call Seurat’s Read10X() and CreateSeuratObject() function over all GSMs. They cooperatively first read a 10X genomics structured file and create a Seurat Object, which in the basic unit for any data manipulation in Seurat.

# Get the list of sample subdirectories
sample_dirs <- list.dirs(dest_dir, recursive = FALSE)

# Define a list storing seurat obj of all GSMs
seurat_list <- list()

message(paste("🚀 Loading", length(sample_dirs), "samples..."))
for (dir in sample_dirs) {
  # Extract GSM ID from the folder name (e.g., "GSM6385438")
  gsm_id <- basename(dir)
  
  # A. Read 10x Data
  # Because you renamed the files, Read10X finds them automatically!
  counts <- Read10X(data.dir = dir)
  
  # B. Create Seurat Object
  seurat_list[[gsm_id]] <- CreateSeuratObject(
    counts = counts, 
    project = gsm_id, 
    min.cells = 0, 
    min.features = 0
  )
  
  message(paste("  ✅ Loaded:", gsm_id))
}
length(seurat_list)
[1] 17

However, the metadata (contain condition information) need to fetched from GEOmetadb, and extract from GSM table’s characteristics_ch1 column, and then insert into corresponding Seurat object’s meta.data layer.

# Load GSMs metadata from GEOmetadb
gsm_list_string <- paste(paste0("'", sn_gsm_list, "'"), collapse = ",")
sql <- paste0(
  "SELECT gsm, title, characteristics_ch1 ",
  "FROM gsm ",
  "WHERE gsm IN (", gsm_list_string, ")"
)
metadata <- dbGetQuery(con, sql)

# Merge the condition metadata to corresponding Seurat obj's metadata
metadata$Condition <- sub(".*condition:\\s*([^;]+).*", "\\1", metadata$characteristics_ch1, ignore.case = TRUE)

# Loop and assign
for (gsm in names(seurat_list)) {
  # Find the condition for this GSM
  val <- metadata$Condition[metadata$gsm == gsm]
  
  # Assign to Seurat object (handling missing values)
  if (length(val) > 0) {
    seurat_list[[gsm]]$Condition <- val
  } else {
    seurat_list[[gsm]]$Condition <- "Unknown"
  }
}

# Final Merge all samples into one seurat obj
combined_seurat <- merge(
 x = seurat_list[[1]], 
 y = seurat_list[-1], 
 add.cell.ids = names(seurat_list), 
 project = "GSE209552"
)
# Verify
table(combined_seurat$Condition)

control     tbi 
  10666   13955 
# Delete `seurat_list` to free memory
rm(seurat_list)

Data exploriation

Plot the following statistics:

  • Number of unique identifier (infer coverage)
  • Number of total amount of identifier (infer depth)
  • Number of samples
  • Number of cells
library(ggplot2)
library(patchwork)

theme_set(theme_classic(base_size = 26))

# 1. Coverage (nFeature_RNA)
p_cov <- VlnPlot(
  combined_seurat, 
  features = "nFeature_RNA", 
  group.by = "Condition", 
  pt.size = 0
) + 
  ggtitle("Coverage") + 
  ylab("nFeature_RNA")

# 2. Depth (nCount_RNA) with adjusted axis
p_depth <- VlnPlot(
  combined_seurat, 
  features = "nCount_RNA", 
  group.by = "Condition", 
  pt.size = 0
) + 
  ggtitle("Depth") + 
  ylab("nCount_RNA") +
  scale_y_continuous(limits = c(0, 50000)) 

# Combine them to recreate p_metrics
p_metrics <- p_cov | p_depth

# Number of cells among conditions
meta_data <- combined_seurat@meta.data

# A. Bar plot: Total Number of Cells per Condition
p_cells <- ggplot(meta_data, aes(x = Condition, fill = Condition)) +
  geom_bar(color = "black") +
  geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5) +
  theme_classic() +
  labs(title = "Total Cells per Group", y = "Number of Cells", x = NULL) +
  theme(legend.position = "none")

# Number of samples among conditions
p_samples <- meta_data %>%
  group_by(Condition) %>%
  summarize(Sample_Count = n_distinct(orig.ident)) %>%
  ggplot(aes(x = Condition, y = Sample_Count, fill = Condition)) +
  geom_bar(stat = "identity", color = "black") +
  geom_text(aes(label = Sample_Count), vjust = -0.5) +
  theme_classic() +
  labs(title = "Total Samples (GSMs) per Group", y = "Number of Samples (GSMs)", x = NULL) +
  theme(legend.position = "none")

# Merge plots
final_plot <- p_metrics / (p_samples | p_cells)

# Adjust font size
final_plot <- final_plot & theme_classic(base_size = 16)
final_plot

Figure 1: Overall statistics of dataset GSE209552 grouped by condition (Control v.s. tbi). The upper-left panel indicate the number of unique identifier; The upper-right panel indicate the total amount of identifier, the maximum amount of identifier was limited up to 50000 for the sake of aesthetics; The lower-left panel indicate number of samples (GSMs); The lower-right panel indicate the number of cells”; X-axis of all four panel indicate condition.

Generally, control group have higher depth and coverage, but less sample and cells.

Then plot the statistics below per sample (GSM):

  • Number of unique identifier (infer coverage)
  • Number of total amount of identifier (infer depth)
  • Number of cells

# Define the vertical line style once to keep it consistent
vertical_line <- geom_vline(xintercept = 5.5, linetype = "dashed", color = "red", size = 1)

# 1. Coverage (nFeature_RNA)
p_cov <- VlnPlot(
  combined_seurat, 
  features = "nFeature_RNA", 
  group.by = "orig.ident", 
  pt.size = 0
) + 
  ggtitle("Coverage") + 
  ylab("nFeature_RNA") +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none",
    axis.title.x = element_blank() # Often cleaner to remove x-label if text is rotated
  ) + vertical_line

# 2. Depth (nCount_RNA) with adjusted axis
p_depth <- VlnPlot(
  combined_seurat, 
  features = "nCount_RNA", 
  group.by = "orig.ident", 
  pt.size = 0
) + 
  ggtitle("Depth") + 
  ylab("nCount_RNA") +
  scale_y_continuous(limits = c(0, 50000)) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none",
    axis.title.x = element_blank()
  ) + vertical_line

# A. Bar plot: Total Number of Cells per Sample
meta_data <- combined_seurat@meta.data # Ensure meta_data is defined
p_cells <- ggplot(meta_data, aes(x = orig.ident, fill = orig.ident)) +
  geom_bar(color = "black") +
  geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5) +
  theme_classic() +
  labs(title = "Total Cells per Sample", y = "Number of Cells", x = NULL) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  ) +
  coord_fixed(ratio = 0.0005)  +
  scale_y_continuous(limits = c(0, 6000)) + vertical_line

# Merge plots
final_plot <- p_cov / p_depth / p_cells + 
  plot_layout(heights = c(1, 1, 1))

# Adjust font size (and re-apply theme elements if needed by the base theme)
final_plot <- final_plot & theme_classic(base_size = 16) & 
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  )

final_plot

Figure 2: Overall statistics per sample (GSM). The top panel indicate number of unique identifier; The middle panel indicate number of total identifier; The bottom panel indicate number of cells; X-axis of all three panel indicate samples (GSMs); The vertical red dash line separate control (left side of red line), and tbi (right side of red line).

In general, the sequence depth and coverage is more consistence among control group. Tbi group has more variated depth and coverage with GSM637681 being the sample of lowest depths and coverage. The number of cells within different sample also varies, from highest 3708 (GSM6376822) to lowest 358 (GSM6376820). This imbalance in cell number may result in downstream analysis dominate by sample which contain more cells.

Another common quality control step in scRNA-seq is to remove damaged cells. When damaged cells were sequenced or damaged during sequencing, the expression intensity will scaled down due to leaked RNA, thus significantly bias downstream clustering, annotation, and differential gene expression analysis. A common method to conduct this filtering is using mitochondrial transcript percentage, based on this idea that if a cell has leaked RNA, the mitochondrial transcript percentage will be relatively higher since mitochondria is relatively large and less likely to leak. Empirically, the threshold is set to be 5% ~ 10%

However, when it comes to snRNA-seq, mitochondrial transcript percentage is expected to be 0% because of the exclusion of cytoplasm, and any none-zero mitochondrial transcript percentage indicate incomplete exclusion of cytoplasm Amezquita et al, 2020. Thus we may want to choice a more strict threshold for snRNA-seq, like 1% ~ 2%, to match the wet-lab design while tolerate low-quality cell to certain extent.

# Use seurat PercentageFeatureSet() function
# Create a new entry in meta.data layer storing mitochondrial transcript percentage
combined_seurat[["percent.mt"]] <- PercentageFeatureSet(combined_seurat, pattern = "^MT-")

# Defin a horizontal line at Y == 5 indicate empirical mito% threashold
horizontal_line <- geom_hline(yintercept = 10, linetype = "dashed", color = "red", size = 1)

# Percent MT per CONDITION (Control vs TBI)
p_condition <- VlnPlot(combined_seurat, features = c("percent.mt"), group.by = "Condition", pt.size = 0) +
  theme(
    axis.title.x = element_blank(), 
    legend.position = "none" # Hide legend since x-axis already shows the names
  ) + 
  ggtitle("Mitochondrial % by Condition") + 
  horizontal_line + 
  ylab("Mitochondrial Transcript (%)") + 
  annotate("text", x = 0.5, y = 12, label = "mito % = 10", color = "black", hjust = 0, size = 5)

# Percent MT per GSM
p_sample <- VlnPlot(combined_seurat, features = c("percent.mt"), group.by = "orig.ident", pt.size = 0) +
  theme(
    axis.title.x = element_blank(), 
    # Rotate x-axis labels 45 degrees so they don't overlap
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none" # Hide legend since x-axis already shows the names
  ) + vertical_line + 
  ggtitle("Mitochondrial % by Sample") + 
  horizontal_line + 
  ylab("Mitochondrial Transcript (%)") + 
  annotate("text", x = 0.5, y = 12, label = "mito % = 10", color = "black", hjust = 0, size = 5)

# 4. Combine them
final_mt_plot <- p_condition / p_sample
final_mt_plot

Figure 3: Mitochondrial transcript percentage grouped by GSMs. The Y-axis indicate mitochondrial transcript percentage; The X-axis of top panel indicate condition (Control v.s.); The X axis of bottom panel indicate GSM samples; The horizontal red dash line indicate empirical Mitochondrial transcript percentage threshold = 10%; The vertical red dash line at bottom panel separate control (left side of red line), and treatment (right side of red line).

However, the Mitochondrial transcript percentage is way higher than expected in all samples and the original paper didn’t assess the Mitochondrial transcript percentage at all. One clue that may explain this unexpected high Mitochondrial transcript percentage from their Limitations of the study is control tissues were post-mortem sample and tbi tissues were surgically evacuated rare case post-impact. Thus the integrity of samples were not at ideal condition and expected to contain damaged cells.

On the other hands, mitochondrial transcript percentage aligned with previous depth and coverage among samples, where control group tend to be more consistent. A hypothesis is that Traumatic Brain Injury (TBI) is a broad term that include various brain damage result from external force, and the intensity and exact pathology may vary among different tbi sample, reflected by sequence depth and coverage. Noticeably, GSM6376814 has relatively higher mitochondrial transcript percentage, and recall that it’s also the sample having lowest depth and coverage, which might imply existing technical variance and/or batch effect that could bias downstream analysis.

Mapping to HUGO symbols

Check the current identifier

head(rownames(combined_seurat))
[1] "MIR1302-2HG" "FAM138A"     "OR4F5"       "AL627309.1" 
[5] "AL627309.3"  "AL627309.2" 

At a glance, it’s HUGO symbol. To verify, we can query those symbol against org.Hs.eg.db.

# Install org.Hs.eg.db
if (!requireNamespace("org.Hs.eg.db", quietly = TRUE)) {
  BiocManager::install("org.Hs.eg.db")
}

# Attach org.Hs.eg.db to current session
library(org.Hs.eg.db)

# Check how many of them exist in the official symbol database
valid_symbols <- mapIds(org.Hs.eg.db, 
                        keys = rownames(combined_seurat), 
                        column = "SYMBOL", 
                        keytype = "SYMBOL", 
                        multiVals = "first")

# Calculate the percentage of matches
sum(!is.na(valid_symbols)) / length(rownames(combined_seurat)) * 100
[1] 100

100% valid symbols confirmed the symbol is acually HUGO symbol.

Cleaning

First we remove GSM6376814, since it has the lowest depth, coverage, and highest mitochondrial transcript percentage, thus pron to contaminate downstream analysis.

combined_seurat <- subset(combined_seurat, subset = orig.ident != "GSM6376814")

Based on Seurat and Heumos et al, 2023, they simply applied hard threshold on mitochondrial transcript percentage, 5% and 8%, at their quality control step. Additionally, refer to previous Figure 3, I empirically choice the threshold to be 10%.

combined_seurat <- subset(combined_seurat, subset = percent.mt <= 10)

In the end, 2887/24621 = 11.7% cells were removed. Compare with original paper, they removed 9511/24621 = 38.6% cells. I suppose part of the reason they apply such restrict quality control is they are aware of the low integrity of samples.

# 1. Mito % Plot (After Cleaning)
p_mito_clean <- VlnPlot(
  combined_seurat, 
  features = "percent.mt", 
  group.by = "orig.ident", 
  pt.size = 0
) + 
  vertical_line + 
  ggtitle("Mitochondrial % per Sample (After Cleaning)") + 
  ylab("Mitochondrial Transcript (%)") + 
  scale_y_continuous(limits = c(0, 10)) + # Zoom in since max is now <= 10
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1), 
    legend.position = "none",
    axis.title.x = element_blank()
  ) +
  coord_fixed(ratio = 0.5)

# 2. Total Cells Plot (Your existing code)
meta_data <- combined_seurat@meta.data 
p_cells <- ggplot(meta_data, aes(x = orig.ident, fill = orig.ident)) +
  geom_bar(color = "black") +
  geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5) +
  theme_classic() +
  labs(title = "Total Cells per Sample (After Cleaning)", y = "Number of Cells", x = NULL) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  ) +
  coord_fixed(ratio = 0.001) +
  scale_y_continuous(limits = c(0, 4000)) + 
  vertical_line

# 3. Combine them vertically
final_clean_plot <- p_mito_clean / p_cells
final_clean_plot

Figure 4: Number of cells after cleaning grouped by GSMs. The Y-axis indicate number of cells; The X-axis indicate GSMs; The vertical red dash line separate control (left side of red line), and tbi (right side of red line).

Normalization

Differ from RNA-seq and bulk RNA-seq normalization we covered in class, sc/snRNA-seq data is zero-inflated, most of gene’s counts is 0 in most of cells. Thus two of the method we covered in class, TMM and RLE, are not suitable for sc/snRNA-seq because they both assume most of the gene are not differentially expressed, zero-inflated counts data will yield inflated fold change when ever an algorithm is trying to calculate fold change (Lytal et al, 2020).

Library size based Log-Normalization

The simple normalization by library size still work as intended and won’t affect by inflated zeros and widely utilized in sing-cell transcriptome field as a baseline (Ge et al, 2025, Lytal et al, 2020, Amezquita et al, 2019). In Seurat, it’s wrapped as NormalizeData() function. Note the default scale.factor is set to be 10000 to account for sc/snRNA-seq’s relative small reads per cell compare with bulk RNA-seq.

For visualization, I choice first randomly sample one cell from each GSM, exclude genes with 0 counts, and plot the pre and post normalized counts on Y-axis, sampled cell on X-axis. The reason of no not plot normalized counts per GSM is because in sc/snRNA-seq, the counts is normalized by the total reads of a cell rather than GSM, thus plot normalized counts per GSM won’t be so meaningful.

library(tidyr)
library(tibble)

# Collapse the split sample layers into a single unified 'counts' layer
combined_seurat <- JoinLayers(combined_seurat)

# 1. Randomly sample exactly ONE cell per GSM (orig.ident)
set.seed(42) # Set seed for reproducibility
sampled_cells <- combined_seurat@meta.data %>%
  rownames_to_column("Cell_ID") %>%
  group_by(orig.ident) %>%
  sample_n(1) %>%
  pull(Cell_ID)

# Subset the Seurat object to only contain these representative cells
single_cell_seurat <- subset(combined_seurat, cells = sampled_cells)

# 2. Extract Pre-Normalization Data (Raw Counts)
raw_matrix <- as.matrix(GetAssayData(single_cell_seurat, layer = "counts"))

df_raw <- as.data.frame(raw_matrix) %>%
  pivot_longer(cols = everything(), names_to = "Cell_ID", values_to = "Count") %>%
  filter(Count > 0) %>%
  mutate(Sample = single_cell_seurat@meta.data[Cell_ID, "orig.ident"])

# Plot Pre-Normalization
p_pre_norm <- ggplot(df_raw, aes(x = Sample, y = Count, fill = Sample)) +
  geom_violin(scale = "width", color = "black", alpha = 0.8) + 
  ggtitle("Pre-Normalization: 1 Cell per GSM (Raw Counts)") + 
  ylab("Expression (Linear Counts)") +
  theme_classic() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none",
    axis.title.x = element_blank()
  ) + vertical_line + 
  scale_y_continuous(limits = c(0, 200))

# 3. Normalization 
combined_seurat <- NormalizeData(combined_seurat, normalization.method = "LogNormalize", scale.factor = 10000)
Performing log-normalization
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
# Re-subset to get the updated normalized data for those exact same cells
single_cell_seurat <- subset(combined_seurat, cells = sampled_cells)

# 4. Extract Post-Normalization Data (Log-Normalized)
# NormalizeData automatically outputs to a unified 'data' layer, so this is safe
norm_matrix <- as.matrix(GetAssayData(single_cell_seurat, layer = "data"))

df_norm <- as.data.frame(norm_matrix) %>%
  pivot_longer(cols = everything(), names_to = "Cell_ID", values_to = "Normalized_Count") %>%
  filter(Normalized_Count > 0) %>%
  mutate(Sample = single_cell_seurat@meta.data[Cell_ID, "orig.ident"])

# Plot Post-Normalization
p_post_norm <- ggplot(df_norm, aes(x = Sample, y = Normalized_Count, fill = Sample)) +
  geom_violin(scale = "width", color = "black", alpha = 0.8) + 
  ggtitle("Post-Normalization: 1 Cell per GSM (Log-Normalized)") + 
  ylab("Expression (Log1p scale)") + 
  theme_classic() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none",
    axis.title.x = element_blank()
  ) + vertical_line + 
  scale_y_continuous(limits = c(0, 6))

# 5. Combine plots
final_norm_plot <- p_pre_norm / p_post_norm
final_norm_plot

Figure 5: Pre and post normalized counts per cell ( randomly sample 1 cell per GSM). The Y-axis of upper panel indicate raw counts; The Y-axis of bottom panel indicate Log-Normalized counts; The X-axis indicate the GSMs that cell is sampled from; The vertical red dash line separate control (left side of red line), and tbi (right side of red line).

The normalized counts yield interesting vases shape, a flat and wide base with numbers of beads attached upon. The flat base is likely because of the zero-inflated nature of sc/snRNA-seq, even though I exclude all gene that has count == 0, most gene still have counts close to zero, following negative binomial distribution. However, I don’t have a well constructed hypothesis to explain those beads attached, maybe because the raw input (raw counts) in discrete, the log-Normalized counts will also be discrete. The relative height of the base is reflecting sequence depth of those cell. When those counts close to zero were normalized by total read of a cell, the larger the total read, the smaller the normalized count. Thus the higher the base, the lower the sequence depth.

Note : X-axis is the GSM those cells sampled from, not GSM itself. So we can’t infer sequence depth of GSM from Figure 5.

Obtain normalized dataframe

NormalizeData()stored normalized counts in data layer. Simply call Seurat::GetAssayData() on data layer to obtain normalized counts, and convert to data.frame use as.data.frame().

# extract normalized data
norm_matrix <- GetAssayData(combined_seurat, layer = "data")

# Check dimensions (Genes x Cells)
dim(norm_matrix)
[1] 33538 21734
# Check rowname
head(rownames(norm_matrix))
[1] "MIR1302-2HG" "FAM138A"     "OR4F5"       "AL627309.1" 
[5] "AL627309.3"  "AL627309.2" 
# Check colname
head(colnames(norm_matrix))
[1] "GSM6376803_AAACCCAGTTTCGGCG-1"
[2] "GSM6376803_AAACGAAAGACCATTC-1"
[3] "GSM6376803_AAACGAACAGAGGTAC-1"
[4] "GSM6376803_AAACGAACATGTGGTT-1"
[5] "GSM6376803_AAACGAATCGGATACT-1"
[6] "GSM6376803_AAACGAATCTGTCCGT-1"
# convert to data.frame
# Comment out to skip due to memory constrain
# norm_df <- as.data.frame(norm_matrix)
# Check dimensions (Genes x Cells)
# dim(norm_df)

The final result is in dgCMatrix instead of data.frame due to memory constrain. Convert to data.frame require additional ~5.4 GB of RAM.

---
title: "Assignment #1"
author: "Jiaqi Ma"
date: "2026-02-06"
output:
  html_notebook:
    toc: true
    toc_depth: 2
    toc_float: true
    code_folding: none
  html_document:
    toc: true
    toc_depth: 2
    df_print: paged
bibliography: BCB420.bib
link-citations: true
---

# Data Download

------------------------------------------------------------------------

-   My choice of dataset is `GSE209552`.
-   Note: `GSE209552` is single nucleus RNA-seq (`snRNA-seq`), a varient of `scRNA-seq` that only sequence nucleus `RNA`.

```{r}
# Define global constant
GSE_NUM <- "GSE209552"
```

## GEOmetadb setup

To setup [GEOmetadb Tutorial](https://www.bioconductor.org/packages/release/bioc/html/GEOmetadb.html), follow [GEOmetadb Tutorial Chapter 3](https://bcb420-2026.github.io/Exercise_Finding_Expression_data/setting-up-geometadb.html)

```{r}
# Install required package if not installed
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
if (!requireNamespace("GEOmetadb", quietly = TRUE))
  BiocManager::install("GEOmetadb")
if (!requireNamespace("DBI", quietly = TRUE))
  install.packages("DBI")
if (!requireNamespace("RSQLite", quietly = TRUE))
  install.packages("RSQLite")
if (!requireNamespace("dplyr", quietly = TRUE))
  install.packages("dplyr")
if (!requireNamespace("ggplot2", quietly = TRUE))
  install.packages("ggplot2")

# Attach packages
library(GEOmetadb)
library(DBI)
library(RSQLite)

# Download SQLite database if not exist
if (!file.exists('GEOmetadb.sqlite')) getSQLiteFile()
file.info('GEOmetadb.sqlite')

# Connect to database
con <- dbConnect(SQLite(), 'GEOmetadb.sqlite')
```

## Download whole GSE209552

Download data, follow [Identifier Mapping Tutorial](https://bcb420-2026.github.io/Example_Student/id_mapping/getting-data-from-geo.html). The helper function defined below is modified based on [Identifier Mapping Tutorial's helper function](https://github.com/bcb420-2026/Example_Student/blob/main/in_class_exercises/id_mapping/fetch_geo_supp.R) to avoid duplicated download.

```{r}
# Define helper function
fetch_geo_supp <- function(gse, destdir = "data") {
  # Calculate the actual path where GEOquery puts the files
  target_path <- file.path(destdir, gse)
  
  # Check if directory exists AND contains files
  if (dir.exists(target_path) && length(list.files(target_path)) > 0) {
    message(paste("Data found at", target_path, "- Skipping download."))
    return(invisible(target_path))
  }
  
  # If missing or empty, proceed with download
  message(paste("Downloading", gse, "to", destdir, "..."))
  dir.create(destdir, showWarnings = FALSE, recursive = TRUE)
  GEOquery::getGEOSuppFiles(GEO = gse, baseDir = destdir, makeDirectory = TRUE)
  
  invisible(target_path)
}

# Download
fetch_geo_supp(gse = GSE_NUM)
```

## Construct snRNA-seq only reference list

This GSE contain both `snRNA-seq` and `bulk RNA-seq` sample. Removal of bulk RNA-seq is needed to avoid significant batch effect due to their technical difference. Although the number of samples is reduced to 17, it's still sufficient amount.

Refer to the [original paper](https://doi.org/10.1016/j.celrep.2023.113395)'s method section, only the `snRNA-seq` was done by `Illumina NextSeq6000` platform, [GPL24676](https://www-ncbi-nlm-nih-gov.myaccess.library.utoronto.ca/geo/query/acc.cgi?acc=GPL24676). We can use this clue to exclude all `bulk RNA-seq`.

Since all GSMs of this study are packed in one `GSE209552_RAW.tar` file in GEO, we can't indicate which platform to include/exclude at download step. Thus we will construct a list of `GPL24676` GSMs for downstream reference by query `GEOmetabd`.

```{r}
# Define desired platform
snRNA_platform <- "GPL24676"

# Construct the query
# We join the 'link' table (gse_gsm) with the 'sample' table (gsm) 
# to check the platform (gpl) column.
sql_filter <- paste0(
  "SELECT t1.gsm ",
  "FROM gse_gsm t1 ",
  "JOIN gsm t2 ON t1.gsm = t2.gsm ",
  "WHERE t1.gse = '", GSE_NUM, "' ",
  "AND t2.gpl = '", snRNA_platform, "'"
)

# Get the list of pure snRNA-seq samples
sn_gsm_list <- dbGetQuery(con, sql_filter)$gsm

# View results
print(paste("Found", length(sn_gsm_list), "snRNA-seq samples."))
sn_gsm_list
```

# Access data quality

-   The following analysis is build on [Seurat](https://satijalab.org/seurat/) framework and referring [Heumos et al, 2023](https://doi.org/10.1038/s41576-023-00586-w) for high-level idea.

```{r}
# Install Seurat if not
if (!requireNamespace("Seurat", quietly = TRUE))
  install.packages("Seurat")
```

## Reorganize snRNA-seq files

The raw count data was stored in `GSE209552_RAW.tar` file, thus need to be unpacked using `untar()` function.

```{r}
# Attach packages
library(Seurat)
library(data.table)
library(Matrix)
library(dplyr)
library(stringr)

# Define data paths
tar_file <- "data/GSE209552/GSE209552_RAW.tar"
extract_dir <- "data/GSE209552/raw_files"

# Define unpack helper function
unpack_geo_tar <- function(tar_path, dest_dir) {
  if (dir.exists(dest_dir) && length(list.files(dest_dir)) > 0) {
    message(paste("📂 Extracted files found at", dest_dir, "- Skipping unpack."))
    return(invisible(dest_dir))
  }
  if (!file.exists(tar_path)) {
    stop(paste("❌ Tar file not found:", tar_path))
  }
  message(paste("📦 Unpacking", basename(tar_path), "..."))
  dir.create(dest_dir, showWarnings = FALSE, recursive = TRUE)
  untar(tar_path, exdir = dest_dir)
  message("✅ Unpacking complete.")
  return(invisible(dest_dir))
}

# Run the unpacker
unpack_geo_tar(tar_file, extract_dir)
```

The unpacked file is very messy, contain all 10x genomics `snRNA-seq` files, `bulk RNA-seq` files, subsetted Transposable Elements file, clustered files, and normalized files. Among those, only 10x genomics `snRNA-seq` is what we need.

To extract, we utilize previously defined `sn_gsm_list`, loop over it, exclude any file contain `_TE_`, indicate Transposable Elements, only extract files end with either `barcode`, `matrix`, or `features`. Lastly, trim the file name to `barcode`, `matrix`, `features` only, to meet the expected naming format of [Seurat](https://satijalab.org/seurat/)'s `Read10X()` function.

```{r}
# Reorganize all snRNA-seq 10x files in to a new directory
dest_dir <- "data/GSE209552/10x_organized"    # Where they should go
dir.create(dest_dir, recursive = TRUE, showWarnings = FALSE)

# --- MAIN LOOP ---
message(paste("🚀 Organizing files for", length(sn_gsm_list), "samples..."))

skipped_count <- 0
moved_count   <- 0

for (gsm in sn_gsm_list) {
  
  # 1. Define the specific destination path for this sample
  sample_subdir <- file.path(dest_dir, gsm)
  
  # 2. CHECK: Does it already exist?
  if (dir.exists(sample_subdir) && length(list.files(sample_subdir)) >= 3) {
    skipped_count <- skipped_count + 1
    next
  }
  
  # 3. Create directory
  dir.create(sample_subdir, showWarnings = FALSE)
  
  # 4. Find source files (Exclude the 'TE' files)
  pattern <- paste0("^", gsm)
  files_to_move <- list.files(extract_dir, pattern = pattern, full.names = TRUE)
  files_to_move <- files_to_move[!grepl("_TE_", files_to_move)]
  
  # Safety Checks
  if (length(files_to_move) == 0) {
    warning(paste("⚠️ Skipping", gsm, "- No source files found."))
    next
  }
  
  # 5. MOVE & RENAME
  # We identify specific files to ensure we rename them correctly for Seurat
  f_matrix   <- files_to_move[grep("matrix", files_to_move, ignore.case = TRUE)]
  f_barcodes <- files_to_move[grep("barcodes", files_to_move, ignore.case = TRUE)]
  f_features <- files_to_move[grep("features|genes", files_to_move, ignore.case = TRUE)]
  
  # Verify we found exactly 1 of each before moving
  if (length(f_matrix) == 1 && length(f_barcodes) == 1 && length(f_features) == 1) {
    
    # Rename Matrix -> matrix.mtx.gz
    file.rename(f_matrix, file.path(sample_subdir, "matrix.mtx.gz"))
    
    # Rename Barcodes -> barcodes.tsv.gz
    file.rename(f_barcodes, file.path(sample_subdir, "barcodes.tsv.gz"))
    
    # Rename Genes/Features -> features.tsv.gz
    file.rename(f_features, file.path(sample_subdir, "features.tsv.gz"))
    
    message(paste("  ✅ Standardized & Moved:", gsm))
    moved_count <- moved_count + 1
    
  } else {
    warning(paste("  ⚠️ Skipping", gsm, "- Could not find unique 10x triplet to standardize."))
  }
}

# --- SUMMARY ---
message("🎉 Organization Complete.")
message(paste("  🔹 Samples moved & standardized:", moved_count))
message(paste("  🔸 Samples skipped (already done):", skipped_count))
```

## Load snRNA-seq files into R using Seurat

Loading of expression data is done by iteratively call [Seurat](https://satijalab.org/seurat/)'s `Read10X()` and `CreateSeuratObject()` function over all GSMs. They cooperatively first read a 10X genomics structured file and create a `Seurat Object`, which in the basic unit for any data manipulation in [Seurat](https://satijalab.org/seurat/).

```{r}
# Get the list of sample subdirectories
sample_dirs <- list.dirs(dest_dir, recursive = FALSE)

# Define a list storing seurat obj of all GSMs
seurat_list <- list()

message(paste("🚀 Loading", length(sample_dirs), "samples..."))
for (dir in sample_dirs) {
  # Extract GSM ID from the folder name (e.g., "GSM6385438")
  gsm_id <- basename(dir)
  
  # A. Read 10x Data
  # Because you renamed the files, Read10X finds them automatically!
  counts <- Read10X(data.dir = dir)
  
  # B. Create Seurat Object
  seurat_list[[gsm_id]] <- CreateSeuratObject(
    counts = counts, 
    project = gsm_id, 
    min.cells = 0, 
    min.features = 0
  )
  
  message(paste("  ✅ Loaded:", gsm_id))
}
length(seurat_list)
```

However, the metadata (contain condition information) need to fetched from [GEOmetadb](https://www.bioconductor.org/packages/release/bioc/html/GEOmetadb.html), and extract from `GSM` table's `characteristics_ch1` column, and then insert into corresponding Seurat object's `meta.data` layer.

```{r}
# Load GSMs metadata from GEOmetadb
gsm_list_string <- paste(paste0("'", sn_gsm_list, "'"), collapse = ",")
sql <- paste0(
  "SELECT gsm, title, characteristics_ch1 ",
  "FROM gsm ",
  "WHERE gsm IN (", gsm_list_string, ")"
)
metadata <- dbGetQuery(con, sql)

# Merge the condition metadata to corresponding Seurat obj's metadata
metadata$Condition <- sub(".*condition:\\s*([^;]+).*", "\\1", metadata$characteristics_ch1, ignore.case = TRUE)

# Loop and assign
for (gsm in names(seurat_list)) {
  # Find the condition for this GSM
  val <- metadata$Condition[metadata$gsm == gsm]
  
  # Assign to Seurat object (handling missing values)
  if (length(val) > 0) {
    seurat_list[[gsm]]$Condition <- val
  } else {
    seurat_list[[gsm]]$Condition <- "Unknown"
  }
}

# Final Merge all samples into one seurat obj
combined_seurat <- merge(
 x = seurat_list[[1]], 
 y = seurat_list[-1], 
 add.cell.ids = names(seurat_list), 
 project = "GSE209552"
)
# Verify
table(combined_seurat$Condition)

# Delete `seurat_list` to free memory
rm(seurat_list)
```

## Data exploriation

Plot the following statistics:

-   Number of unique identifier (infer coverage)
-   Number of total amount of identifier (infer depth)
-   Number of samples
-   Number of cells

```{r qc-plot-per-condition, fig.width=14, fig.height=14, fig.align='center', dpi=300, out.width="100%"}
library(ggplot2)
library(patchwork)

theme_set(theme_classic(base_size = 26))

# 1. Coverage (nFeature_RNA)
p_cov <- VlnPlot(
  combined_seurat, 
  features = "nFeature_RNA", 
  group.by = "Condition", 
  pt.size = 0
) + 
  ggtitle("Coverage") + 
  ylab("nFeature_RNA")

# 2. Depth (nCount_RNA) with adjusted axis
p_depth <- VlnPlot(
  combined_seurat, 
  features = "nCount_RNA", 
  group.by = "Condition", 
  pt.size = 0
) + 
  ggtitle("Depth") + 
  ylab("nCount_RNA") +
  scale_y_continuous(limits = c(0, 50000)) 

# Combine them to recreate p_metrics
p_metrics <- p_cov | p_depth

# Number of cells among conditions
meta_data <- combined_seurat@meta.data

# A. Bar plot: Total Number of Cells per Condition
p_cells <- ggplot(meta_data, aes(x = Condition, fill = Condition)) +
  geom_bar(color = "black") +
  geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5) +
  theme_classic() +
  labs(title = "Total Cells per Group", y = "Number of Cells", x = NULL) +
  theme(legend.position = "none")

# Number of samples among conditions
p_samples <- meta_data %>%
  group_by(Condition) %>%
  summarize(Sample_Count = n_distinct(orig.ident)) %>%
  ggplot(aes(x = Condition, y = Sample_Count, fill = Condition)) +
  geom_bar(stat = "identity", color = "black") +
  geom_text(aes(label = Sample_Count), vjust = -0.5) +
  theme_classic() +
  labs(title = "Total Samples (GSMs) per Group", y = "Number of Samples (GSMs)", x = NULL) +
  theme(legend.position = "none")

# Merge plots
final_plot <- p_metrics / (p_samples | p_cells)

# Adjust font size
final_plot <- final_plot & theme_classic(base_size = 16)
final_plot
```

**Figure 1:** Overall statistics of dataset `GSE209552` grouped by condition (Control v.s. tbi). The upper-left panel indicate the number of unique identifier; The upper-right panel indicate the total amount of identifier, the maximum amount of identifier was limited up to 50000 for the sake of aesthetics; The lower-left panel indicate number of samples (GSMs); The lower-right panel indicate the number of cells"; X-axis of all four panel indicate condition.

Generally, control group have higher depth and coverage, but less sample and cells.

Then plot the statistics below per sample (GSM):

-   Number of unique identifier (infer coverage)
-   Number of total amount of identifier (infer depth)
-   Number of cells

```{r qc-plot-per-sample, fig.width=14, fig.height=14, fig.align='center', dpi=300, out.width="100%"}

# Define the vertical line style once to keep it consistent
vertical_line <- geom_vline(xintercept = 5.5, linetype = "dashed", color = "red", size = 1)

# 1. Coverage (nFeature_RNA)
p_cov <- VlnPlot(
  combined_seurat, 
  features = "nFeature_RNA", 
  group.by = "orig.ident", 
  pt.size = 0
) + 
  ggtitle("Coverage") + 
  ylab("nFeature_RNA") +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none",
    axis.title.x = element_blank() # Often cleaner to remove x-label if text is rotated
  ) + vertical_line

# 2. Depth (nCount_RNA) with adjusted axis
p_depth <- VlnPlot(
  combined_seurat, 
  features = "nCount_RNA", 
  group.by = "orig.ident", 
  pt.size = 0
) + 
  ggtitle("Depth") + 
  ylab("nCount_RNA") +
  scale_y_continuous(limits = c(0, 50000)) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none",
    axis.title.x = element_blank()
  ) + vertical_line

# A. Bar plot: Total Number of Cells per Sample
meta_data <- combined_seurat@meta.data # Ensure meta_data is defined
p_cells <- ggplot(meta_data, aes(x = orig.ident, fill = orig.ident)) +
  geom_bar(color = "black") +
  geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5) +
  theme_classic() +
  labs(title = "Total Cells per Sample", y = "Number of Cells", x = NULL) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  ) +
  coord_fixed(ratio = 0.0005)  +
  scale_y_continuous(limits = c(0, 6000)) + vertical_line

# Merge plots
final_plot <- p_cov / p_depth / p_cells + 
  plot_layout(heights = c(1, 1, 1))

# Adjust font size (and re-apply theme elements if needed by the base theme)
final_plot <- final_plot & theme_classic(base_size = 16) & 
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  )

final_plot
```

**Figure 2:** Overall statistics per sample (GSM). The top panel indicate number of unique identifier; The middle panel indicate number of total identifier; The bottom panel indicate number of cells; X-axis of all three panel indicate samples (GSMs); The vertical red dash line separate control (left side of red line), and tbi (right side of red line).

In general, the sequence depth and coverage is more consistence among control group. Tbi group has more variated depth and coverage with `GSM637681` being the sample of lowest depths and coverage. The number of cells within different sample also varies, from highest 3708 (`GSM6376822`) to lowest 358 (`GSM6376820`). This imbalance in cell number may result in downstream analysis dominate by sample which contain more cells.

Another common quality control step in `scRNA-seq` is to remove `damaged cells`. When damaged cells were sequenced or damaged during sequencing, the expression intensity will scaled down due to leaked RNA, thus significantly bias downstream clustering, annotation, and differential gene expression analysis. A common method to conduct this filtering is using `mitochondrial transcript percentage`, based on this idea that if a cell has leaked RNA, the `mitochondrial transcript percentage` will be relatively higher since mitochondria is relatively large and less likely to leak. Empirically, the threshold is set to be `5% ~ 10%`

However, when it comes to `snRNA-seq`, `mitochondrial transcript percentage` is expected to be `0%` because of the exclusion of cytoplasm, and any none-zero `mitochondrial transcript percentage` indicate incomplete exclusion of cytoplasm [Amezquita et al, 2020](https://doi.org/10.1038/s41592-019-0654-x). Thus we may want to choice a more strict threshold for `snRNA-seq`, like `1% ~ 2%`, to match the wet-lab design while tolerate low-quality cell to certain extent.

```{r mt-plot, fig.width=14, fig.height=14, fig.align='center', dpi=300, out.width="100%"}
# Use seurat PercentageFeatureSet() function
# Create a new entry in meta.data layer storing mitochondrial transcript percentage
combined_seurat[["percent.mt"]] <- PercentageFeatureSet(combined_seurat, pattern = "^MT-")

# Defin a horizontal line at Y == 5 indicate empirical mito% threashold
horizontal_line <- geom_hline(yintercept = 10, linetype = "dashed", color = "red", size = 1)

# Percent MT per CONDITION (Control vs TBI)
p_condition <- VlnPlot(combined_seurat, features = c("percent.mt"), group.by = "Condition", pt.size = 0) +
  theme(
    axis.title.x = element_blank(), 
    legend.position = "none" # Hide legend since x-axis already shows the names
  ) + 
  ggtitle("Mitochondrial % by Condition") + 
  horizontal_line + 
  ylab("Mitochondrial Transcript (%)") + 
  annotate("text", x = 0.5, y = 12, label = "mito % = 10", color = "black", hjust = 0, size = 5)

# Percent MT per GSM
p_sample <- VlnPlot(combined_seurat, features = c("percent.mt"), group.by = "orig.ident", pt.size = 0) +
  theme(
    axis.title.x = element_blank(), 
    # Rotate x-axis labels 45 degrees so they don't overlap
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none" # Hide legend since x-axis already shows the names
  ) + vertical_line + 
  ggtitle("Mitochondrial % by Sample") + 
  horizontal_line + 
  ylab("Mitochondrial Transcript (%)") + 
  annotate("text", x = 0.5, y = 12, label = "mito % = 10", color = "black", hjust = 0, size = 5)

# 4. Combine them
final_mt_plot <- p_condition / p_sample
final_mt_plot
```

**Figure 3:** Mitochondrial transcript percentage grouped by GSMs. The Y-axis indicate mitochondrial transcript percentage; The X-axis of top panel indicate condition (Control v.s.); The X axis of bottom panel indicate GSM samples; The horizontal red dash line indicate empirical Mitochondrial transcript percentage threshold = 10%; The vertical red dash line at bottom panel separate control (left side of red line), and treatment (right side of red line).

However, the `Mitochondrial transcript percentage` is way higher than expected in all samples and the [original paper](https://doi.org/10.1016/j.celrep.2023.113395) didn't assess the `Mitochondrial transcript percentage` at all. One clue that may explain this unexpected high `Mitochondrial transcript percentage` from their `Limitations of the study` is control tissues were post-mortem sample and tbi tissues were surgically evacuated rare case post-impact. Thus the integrity of samples were not at ideal condition and expected to contain damaged cells.

On the other hands, `mitochondrial transcript percentage` aligned with previous depth and coverage among samples, where control group tend to be more consistent. A hypothesis is that Traumatic Brain Injury (TBI) is a broad term that include various brain damage result from external force, and the intensity and exact pathology may vary among different tbi sample, reflected by sequence depth and coverage. Noticeably, `GSM6376814` has relatively higher mitochondrial transcript percentage, and recall that it's also the sample having lowest depth and coverage, which might imply existing technical variance and/or batch effect that could bias downstream analysis.

## Mapping to HUGO symbols

Check the current identifier

```{r}
head(rownames(combined_seurat))
```

At a glance, it's `HUGO` symbol. To verify, we can query those symbol against `org.Hs.eg.db`.

```{r}
# Install org.Hs.eg.db
if (!requireNamespace("org.Hs.eg.db", quietly = TRUE)) {
  BiocManager::install("org.Hs.eg.db")
}

# Attach org.Hs.eg.db to current session
library(org.Hs.eg.db)

# Check how many of them exist in the official symbol database
valid_symbols <- mapIds(org.Hs.eg.db, 
                        keys = rownames(combined_seurat), 
                        column = "SYMBOL", 
                        keytype = "SYMBOL", 
                        multiVals = "first")

# Calculate the percentage of matches
sum(!is.na(valid_symbols)) / length(rownames(combined_seurat)) * 100
```

`100%` valid symbols confirmed the symbol is acually `HUGO` symbol.

## Cleaning

First we remove `GSM6376814`, since it has the lowest depth, coverage, and highest `mitochondrial transcript percentage`, thus pron to contaminate downstream analysis.

```{r}
combined_seurat <- subset(combined_seurat, subset = orig.ident != "GSM6376814")
```

Based on [Seurat](https://satijalab.org/seurat/) and [Heumos et al, 2023](https://doi.org/10.1038/s41576-023-00586-w), they simply applied hard threshold on mitochondrial transcript percentage, `5%` and `8%`, at their quality control step. Additionally, refer to previous `Figure 3`, I empirically choice the threshold to be `10%`.

```{r}
combined_seurat <- subset(combined_seurat, subset = percent.mt <= 10)
```

In the end, `2887/24621 = 11.7%` cells were removed. Compare with [original paper](https://doi.org/10.1016/j.celrep.2023.113395), they removed `9511/24621 = 38.6%` cells. I suppose part of the reason they apply such restrict quality control is they are aware of the low integrity of samples.

```{r filtered-plot, fig.width=14, fig.height=10, fig.align='center', dpi=300, out.width="100%"}
# 1. Mito % Plot (After Cleaning)
p_mito_clean <- VlnPlot(
  combined_seurat, 
  features = "percent.mt", 
  group.by = "orig.ident", 
  pt.size = 0
) + 
  vertical_line + 
  ggtitle("Mitochondrial % per Sample (After Cleaning)") + 
  ylab("Mitochondrial Transcript (%)") + 
  scale_y_continuous(limits = c(0, 10)) + # Zoom in since max is now <= 10
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1), 
    legend.position = "none",
    axis.title.x = element_blank()
  ) +
  coord_fixed(ratio = 0.5)

# 2. Total Cells Plot (Your existing code)
meta_data <- combined_seurat@meta.data 
p_cells <- ggplot(meta_data, aes(x = orig.ident, fill = orig.ident)) +
  geom_bar(color = "black") +
  geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5) +
  theme_classic() +
  labs(title = "Total Cells per Sample (After Cleaning)", y = "Number of Cells", x = NULL) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  ) +
  coord_fixed(ratio = 0.001) +
  scale_y_continuous(limits = c(0, 4000)) + 
  vertical_line

# 3. Combine them vertically
final_clean_plot <- p_mito_clean / p_cells
final_clean_plot
```

**Figure 4:** Number of cells after cleaning grouped by GSMs. The Y-axis indicate number of cells; The X-axis indicate GSMs; The vertical red dash line separate control (left side of red line), and tbi (right side of red line).

# Normalization

Differ from RNA-seq and bulk RNA-seq normalization we covered in class, `sc/snRNA-seq` data is `zero-inflated`, most of gene's counts is `0` in most of cells. Thus two of the method we covered in class, `TMM` and `RLE`, are not suitable for `sc/snRNA-seq` because they both assume most of the gene are not differentially expressed, `zero-inflated` counts data will yield inflated fold change when ever an algorithm is trying to calculate fold change ([Lytal et al, 2020](https://doi.org/10.3389/fgene.2020.00041)).

## Library size based Log-Normalization

The simple normalization by library size still work as intended and won't affect by inflated zeros and widely utilized in sing-cell transcriptome field as a baseline ([Ge et al, 2025](https://doi.org/10.1371/journal.pone.0335102), [Lytal et al, 2020](https://doi.org/10.3389/fgene.2020.00041), [Amezquita et al, 2019](https://doi.org/10.1038/s41592-019-0654-x)). In `Seurat`, it's wrapped as [NormalizeData()](https://satijalab.org/seurat/reference/normalizedata) function. Note the default `scale.factor` is set to be `10000` to account for `sc/snRNA-seq`'s relative small reads per cell compare with `bulk RNA-seq`.

For visualization, I choice first randomly sample one cell from each `GSM`, exclude genes with `0` counts, and plot the pre and post normalized counts on Y-axis, sampled cell on X-axis. The reason of no not plot normalized counts per `GSM` is because in `sc/snRNA-seq`, the counts is normalized by the total reads of a cell rather than `GSM`, thus plot normalized counts per `GSM` won't be so meaningful.

```{r pre-post-norm-plot, fig.width=14, fig.height=10, fig.align='center', dpi=300, out.width="100%"}
library(tidyr)
library(tibble)

# Collapse the split sample layers into a single unified 'counts' layer
combined_seurat <- JoinLayers(combined_seurat)

# 1. Randomly sample exactly ONE cell per GSM (orig.ident)
set.seed(42) # Set seed for reproducibility
sampled_cells <- combined_seurat@meta.data %>%
  rownames_to_column("Cell_ID") %>%
  group_by(orig.ident) %>%
  sample_n(1) %>%
  pull(Cell_ID)

# Subset the Seurat object to only contain these representative cells
single_cell_seurat <- subset(combined_seurat, cells = sampled_cells)

# 2. Extract Pre-Normalization Data (Raw Counts)
raw_matrix <- as.matrix(GetAssayData(single_cell_seurat, layer = "counts"))

df_raw <- as.data.frame(raw_matrix) %>%
  pivot_longer(cols = everything(), names_to = "Cell_ID", values_to = "Count") %>%
  filter(Count > 0) %>%
  mutate(Sample = single_cell_seurat@meta.data[Cell_ID, "orig.ident"])

# Plot Pre-Normalization
p_pre_norm <- ggplot(df_raw, aes(x = Sample, y = Count, fill = Sample)) +
  geom_violin(scale = "width", color = "black", alpha = 0.8) + 
  ggtitle("Pre-Normalization: 1 Cell per GSM (Raw Counts)") + 
  ylab("Expression (Linear Counts)") +
  theme_classic() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none",
    axis.title.x = element_blank()
  ) + vertical_line + 
  scale_y_continuous(limits = c(0, 200))

# 3. Normalization 
combined_seurat <- NormalizeData(combined_seurat, normalization.method = "LogNormalize", scale.factor = 10000)

# Re-subset to get the updated normalized data for those exact same cells
single_cell_seurat <- subset(combined_seurat, cells = sampled_cells)

# 4. Extract Post-Normalization Data (Log-Normalized)
# NormalizeData automatically outputs to a unified 'data' layer, so this is safe
norm_matrix <- as.matrix(GetAssayData(single_cell_seurat, layer = "data"))

df_norm <- as.data.frame(norm_matrix) %>%
  pivot_longer(cols = everything(), names_to = "Cell_ID", values_to = "Normalized_Count") %>%
  filter(Normalized_Count > 0) %>%
  mutate(Sample = single_cell_seurat@meta.data[Cell_ID, "orig.ident"])

# Plot Post-Normalization
p_post_norm <- ggplot(df_norm, aes(x = Sample, y = Normalized_Count, fill = Sample)) +
  geom_violin(scale = "width", color = "black", alpha = 0.8) + 
  ggtitle("Post-Normalization: 1 Cell per GSM (Log-Normalized)") + 
  ylab("Expression (Log1p scale)") + 
  theme_classic() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none",
    axis.title.x = element_blank()
  ) + vertical_line + 
  scale_y_continuous(limits = c(0, 6))

# 5. Combine plots
final_norm_plot <- p_pre_norm / p_post_norm
final_norm_plot
```

**Figure 5:** Pre and post normalized counts per cell ( randomly sample 1 cell per `GSM`). The Y-axis of upper panel indicate raw counts; The Y-axis of bottom panel indicate Log-Normalized counts; The X-axis indicate the GSMs that cell is sampled from; The vertical red dash line separate control (left side of red line), and tbi (right side of red line).

The normalized counts yield interesting vases shape, a flat and wide base with numbers of `beads` attached upon. The flat base is likely because of the `zero-inflated` nature of `sc/snRNA-seq`, even though I exclude all gene that has count == 0, most gene still have counts close to zero, following negative binomial distribution. However, I don't have a well constructed hypothesis to explain those `beads` attached, maybe because the raw input (raw counts) in discrete, the log-Normalized counts will also be discrete. The relative height of the base is reflecting sequence depth of those cell. When those counts close to zero were normalized by total read of a cell, the larger the total read, the smaller the normalized count. Thus the higher the base, the lower the sequence depth.

**Note :** X-axis is the `GSM` those cells sampled from, not `GSM` itself. So we can't infer sequence depth of `GSM` from **Figure 5**.

## Obtain normalized dataframe

`NormalizeData()`stored normalized counts in `data` layer. Simply call `Seurat::GetAssayData()` on `data` layer to obtain normalized counts, and convert to `data.frame` use `as.data.frame()`.

```{r}
# extract normalized data
norm_matrix <- GetAssayData(combined_seurat, layer = "data")

# Check dimensions (Genes x Cells)
dim(norm_matrix)

# Check rowname
head(rownames(norm_matrix))

# Check colname
head(colnames(norm_matrix))

# convert to data.frame
# Comment out to skip due to memory constrain
# The script below prompt to allocate 5.4 GM of RAM
# norm_df <- as.data.frame(norm_matrix)
# Check dimensions (Genes x Cells)
# dim(norm_df)
```

The final result is in `dgCMatrix` instead of `data.frame` due to memory constrain. Convert to `data.frame` require additional `~5.4 GB` of RAM.